Data Collection

Data is all around us: think about weather forecasts, traffic data on Google Maps, statistics of your favourite sports teams, or even the number of likes on an Instagram post. Every day, we use this data to make decisions. We use weather data to decide what to wear; traffic data to decide which route to take to university; sports statistics to decide if our team has a chance to win their next match; social media data to decide if an influencer is popular or not; and much more.

Figure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/womanFigure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/womanFigure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/womanFigure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/woman

Figure 1: Sources of data used for every day decisions. Image attribution: Weather reporter from https://www.pngwing.com/; Traffic data from Google Maps; Rugby data from ESPN via Wikipedia; Influencer from https://www.vecteezy.com/free-vector/woman

Any statistical model or analysis is only as good as the input data. No matter how sophisticated the analysis, if the input data is poor, the output results will not be any good. This is known as the Garbage-In-Garbage-Out (GIGO) principle.

Figure 2: The GIGO principle

As good data analysts, it is our responsibility to ensure that, whenever we are involved in data collection, it is done correctly, in a systematic, accurate, and unbiased way. The goal of statistical data collection is to gather information that can be used to find patterns and trends, accurately answer questions, test hypotheses, and make evidence-based decisions.

The process of data collection is multi-faceted. It involves identifying what we want to study, choosing the right methods to collect the data (like surveys, experiments, observations, or finding secondary or tertiary data sources), and ensuring the information is reliable and representative. By learning about data collection, you will gain tools to make informed decisions in a variety of fields, from science and business to social issues. It will also equip you, as a data analyst, to help others to do so.

Q: Can you think of other sources of data that you use to make daily decisions? How reliable do you think these sources are?

Planning data collection

Before you start collecting data, it is crucial that you plan how you are going to do it. You will need to ask yourself (at least) the following questions:

  1. What is the question I am trying to answer?
  2. What/who is the population I am studying?
  3. What kind of information do I need from the population to answer this question?
  4. How can I obtain a dataset that will be representative of the information I need from this population?
  5. How can I obtain such a dataset in an ethical way?

Example 1: Raheem wants to open a new resaurant on Hatfield campus. Before doing this, he wants to know what the students think of the restaurants and fast food places already available on campus, in terms of the cost, freshness variety and taste of food. Following the data collection planning questions above, he gives the following answers:

  1. What is the question I am trying to answer? What are the perceptions of students on Hatfield campus about the cost, freshness, variety and taste of food already available on campus?
  2. What/who is the population I am studying? Students on Hatfield campus.
  3. What kind of information do I need from the population to answer this question? Answers from students about their perceptions of the available food options.
  4. How can I obtain a dataset that will be representative of the information I need from this population? I can conduct a survey of students on Hatfield campus.
  5. How can I obtain such a dataset in an ethical way? By obtaining the relevant permissions to conduct such a survey from the University, and to obtain clear consent from every student I ask, after I have explained the purpose of the survey. I will also keep the students’ answers anonymous.

Example 2: Tebogo is an analyst for a private security firm operating on the Hatfield City Improvement District (CID). Her security firm has recently employed a new strategy to combat theft. She wants to know whether thefts have decreased since they employed the strategy. She answers the questions as follows:

  1. What is the question I am trying to answer? Have thefts in the Hatfield CID decreased since our security firm employed its new theft-prevention strategy?
  2. What/who is the population I am studying? Everyone working in, living in, or travelling through the Hatfield CID.
  3. What kind of information do I need from the population to answer this question? Theft statistics from all relevant police stations whose precincts overlap with the CID.
  4. How can I obtain a dataset that will be representative of the information I need from this population? I can request crime statistics from the relevant police stations.
  5. How can I obtain such a dataset in an ethical way? By obtaining the relevant permissions from the SAPD, the relevant persons at the police stations themselves, and signing any necessary agreements about my use of the data.

Example 3: William is an ecologist who wants to determine if a new pesticide-free anti-fungal treatment he has developed will keep maize safe from fungal infections. Below are his answers to the data collection planning questions:

  1. What is the question I am trying to answer? Whether my new anti-fungal treatment works to protect maize from fungal infections.
  2. What/who is the population I am studying? Maize plants.
  3. What kind of information do I need from the population to answer this question? Data on the health of maize plants that were given the treatment, and maize plants that were not given the treatment, when exposed to fungi.
  4. How can I obtain a dataset that will be representative of the information I need from this population? By planting two fields of maize, giving one the anti-fungal treatment and leaving the other without treatment, and then exposing them both to the fungus.
  5. How can I obtain such a dataset in an ethical way? By making sure the fungus cannot spread to any other plants or crops.

Class Exercise Question 1: You are asked to determine the favourite movie of first-year mathematical sciences students in your class. Answer the data collection planning questions at the beginning of this section to determine what your intended dataset is, and how you will collect it.

  1. What is the question I am trying to answer?
  2. What/who is the population I am studying?
  3. What kind of information do I need from the population to answer this question?
  4. How can I obtain a dataset that will be representative of the information I need from this population?
  5. How can I obtain such a dataset in an ethical way?

Primary data collection

Primary data is collected first-hand by the researcher in order to answer a specific question or questions. Examples of primary data collection include conducting interviews and surveys to ask about people’s opinions and experiences; conducting experiments in a laboratory; collecting field data, such as animal tracking data; and taking direct measurements (e.g. the chemistry of plants, or the weight of animals).

Primary data is complex and resource-intensive to collect. It also requires an in-depth understanding of the answers to the data collection planning questions in the previous section. When you collect primary data, it is your responsibility to ensure that the correct data is collected in the correct way, and that the data is representative, unbiased, and ethical. We will learn more about representative and unbiased data in the section on evaluating data.

Exercise: For each of Examples 1-3 in the previous section, identify the type of primary data collected (survey, experiment, field data, or direct measurements), or indicate if it was not primary data.

Class Exercise Question 2: Is the dataset from Class Exercise Question 1 a primary, secondary or tertiary dataset? Can you obtain the data through surveys, interviews, experiments, field data, or direct measurements?

Secondary data collection

Secondary data is data that was collected by a different researcher for a purpose that is different from the current study. Examples of secondary data include data from the national census, marketing data collected by a company, crime data, social media data, and more.

Secondary data collection is usually done in one of the following ways:

  • By approaching the custodian of the primary data and obtaining their permission to use the data. This is usually done with sensitive data like public health or crime data.
  • By downloading an open dataset from the internet.
  • By webscraping or performing other techniques to gather data from the internet.

Q: Can you think of other examples of secondary data, and how you would collect them?

The main challenge when collecting secondary data is to make sure that it is the correct data to answer your research question. Even though secondary data was collected by someone else, you as the analyst still need to ensure that the data is of good quality, and ethical. Although the original data was not collected by you, you are still responsible for the ethics of the data as it pertains to your study. If the data was collected unethically, you could still face consequences for using it. This means that you cannot assume that the data is relevant and ethical.

Exercise: For each of Examples 1-3, if the data collected was not primary, identify the type of secondary data collected.

Evaluating Data

Once you have collected data (whether it is primary or secondary), you need to be able to determine if the data is good and fit for use. This section explains the attributes of a good dataset, and how to check if it is relevant to the research question at hand. The key aspects of a good dataset include relevance, quality, representativeness, unbiasedness, and impact.

Relevance

Relevant data is data that is applicable to the research question at hand. The analyst must be able to use this data to answer their research question. It must also be up-to-date for the purpose of the study. It is important to ensure that data is relevant, since irrelevant or redundant data can clutter the analysis and reduce the efficiency of the study.

For example, if an insurer wants to answer a question about short-term insurance in 2024, a dataset on long-term insurance in 2024 would be irrelevant. Similarly, a dataset on short-term insurance in 1996 would be irrelevant.

In order to ensure that primary data is relevant, one should collect only necessary data and regularly review datasets for alignment with the research objectives. For secondary data, one should determine the scope of the data and date of collection.

Quality

The data must be of good quality. This includes completeness and consistency.

A dataset is complete if it has minimal missing or incomplete data. Gaps in data can distort the analysis, or require assumptions that may not be valid.

In order to ensure that primary data is complete, one should clearly indicate and document missing data, and where possible, implement strategies to fill gaps responsibly. For secondary data, one should determine if any data is missing. If there is a large amount of missing data, this may indicate that the dataset is not suitable. If there are minimal missing values, one should use reliable techniques to impute missing data without making any undue assumptions.

As an example, missing data often arises in longitudinal health studies. These studies typically attempt to determine the status of patients over time. Missing values occur when patients do not show up for follow-up appointments and drop out of the study without explaining why. This can happen if they simply forget their follow-up appointments; if they feel better, and no longer feel it is necessary to visit the hospital; or if they move away; or for a host of other reasons. In such studies, it is therefore crucial to educate the participating patients on the need to attend their follow-up visits if they are able to.

A dataset is consistent if the data was recorded in a uniform and standardised manner across the dataset. Inconsistent data formats (e.g., different date formats or units) can complicate the analysis and increase the risk of errors.

For consistency of primary data, it is important to use standardised units of measurements (if applicable), standard questions with clear instructions on how to answer them (if applicable), standardised data formats, and data entry procedures. For secondary data, it is important to investigate whether the data formats and units are consistent across the dataset. If they are not consistent, this should be remedied before the data can be used.

Inconsistencies slip into datasets more easily than one might suppose. For example, if seven people are asked to write the date of New Year’s Eve in 2024, they might write it as follows:

  • 31/12/24
  • 31 Dec 24
  • 31st of Dec 2024
  • 31/12/2024
  • 31 December 2024
  • 12/31/2024 (This one is strange, but it follows the date format used in the USA. Sometimes, people’s phones or laptops might be set to the USA format by default, which can lead to these errors.)
  • 2024/12/31

To ensure consistent dates, for instance, one could provide an example of a date (e.g. 31/12/2024) or a standard date format (e.g. DD/MM/YYYY).

Representativeness and unbiasedness

The data must accurately represent the population being studied.

The data must avoid introducing bias. Bias occurs when the data over-represents some members of the population, and under-represents others, or if it represents some members of the population in an unduly positive or negative light.

Impact

The data should not have a potentially harmful impact.

Examples

Example 4 (continuation of Example 1): Raheem has finished collecting surveys from students about their perceptions of the food available on campus. He inspects his dataset using the key aspects above, and comes to the following conclusions:

  1. Relevance: Since this was primary data, it was collected by the investigator to answer his specific research question. The data is up-to-date. It is thus relevant.
  2. Quality: The surveys that were given to students were standardised. All students received a copy of the same survey, with clear instructions on how to answer each question. Thus, the dataset is consistent. Furthermore, nearly all of the respondents answered all the questions. Thus, the dataset is complete.
  3. Representativeness and unbiasedness: Students from all across Hatfield campus were asked to fill in the survey. This included students from different years of study, different degrees and different faculties, as well as diverse demographic and socio-economic backgrounds. Thus, the data represents the diverse student body on Hatfield campus. Furthermore, surveys were handed out to students at a variety of spots on campus, including far away from any food vendors, and regardless of whether or not students were eating purchased food, home-made food, or not eating at all. Thus, there was little if any bias.
  4. Impact: The data should not have a potentially harmful impact. The data was anonymised, so that students’ answers on the survey could not be linked to their identities in any way. Any mention of specific restaurants or food outlets was also removed, so that no student’s opinion could be linked to any existing vendor on Hatfield campus. Thus, there is very little chance of any potentially harmful impact on either students or food vendors.

Example 5 (continuation of Example 2):

  1. Relevance: .
  2. Quality: .
  3. Representativeness and unbiasedness: .
  4. Impact: .

Example 6 (continuation of Example 3):

  1. Relevance: .
  2. Quality: .
  3. Representativeness and unbiasedness: .
  4. Impact: .

Class Exercise Question 3:

Sampling

Concepts of sampling, estimating, bias, representativeness, sample size. Census versus Sample

This subsection introduces the differences between censuses and samples, and explains the need to take samples.

Probabilistic Sampling

Simple Random Sampling

Systematic Sampling

Stratified Sampling

Cluster Sampling

Non-Probabilistic Sampling

Convenience sampling